Skip to content

fix(io): generalize PR #54 write-path contract to non-TLS sites (P3)#12

Open
andypost wants to merge 1 commit into
masterfrom
p3-write-path-non-tls
Open

fix(io): generalize PR #54 write-path contract to non-TLS sites (P3)#12
andypost wants to merge 1 commit into
masterfrom
p3-write-path-non-tls

Conversation

@andypost
Copy link
Copy Markdown
Owner

@andypost andypost commented May 8, 2026

Summary

Phase 3 (Pattern D′) of the graceful-shutdown / graceful-reload plan documented in roadmap/plan-graceful-shutdown.md on the roadmap branch.

Audits the five upstream TODO/FIXME sites flagged for the silent fall-through on write paths bug family and applies PR nginx#54's contract where it matters. Two are real NULL-deref bugs that needed code fixes; three are documentation-only dispositions.

Real fixes — NULL-deref on transient port mem_pool OOM

Both sites in src/nxt_port_socket.c had if (b == NULL) {} with an empty body, then dereferenced b->mem.pos immediately after. Strict improvement: crash → orderly teardown.

src/nxt_port_socket.c:746nxt_port_read_handler

b = nxt_port_buf_alloc(port);
if (nxt_slow_path(b == NULL)) {
    nxt_alert(task, "...buf alloc failed; disabling read", ...);
    nxt_fd_event_block_read(task->thread->engine, &port->socket);
    goto fail;   // existing label at :802 queues nxt_port_error_handler
}
nxt_assert(b != NULL);  // invariant for future audits

src/nxt_port_socket.c:889nxt_port_queue_read_handler

Same shape, plus nxt_atomic_fetch_add(&queue->nitems, -1) to balance the +1 at function entry (:830), then queue nxt_port_error_handler and return.

Recovery via timer-driven retry was rejected as out of P3 scope; terminal teardown is the strict improvement over the pre-P3 NULL-deref.

Documentation-only dispositions

Site What was there What it is now
src/nxt_port_socket.c:1369 nxt_port_error_handler bare /* TODO */ Disposition note. Audit found the handler already drains queued sends, releases buffers, decrements port refcounts. No silent-success or NULL-deref bug. Richer error responses are a future enhancement, not part of PR nginx#54's contract.
src/nxt_router.c:5895,5917 nxt_router_get_mmap_handler two // FIXME markers on best-effort RPC_ERROR sends One-line disposition note + explicit (void) cast matching the convention at the end of the same function. Function is void, delivery failures surface in port->socket.error, port-side teardown fires, peer-side RPC timeout is the backstop.

Tests

No new tests. The NULL-deref sites cannot be reliably reproduced from Python without malloc-failure injection (LD_PRELOAD shim or AddressSanitizer with allocator hooks); ASan validation is recommended in CI follow-up.

The two nxt_assert(b != NULL) lines after the OOM-teardown blocks provide deterministic in-tree documentation of the invariant in lieu of a behavioural test — a future regression that re-introduces an empty if (b == NULL) {} body would nxt_assert on the next iteration.

Verification

./configure --tests --modules=python && ./configure python --config=python3-config && make -j$(nproc)
                                                                               # clean
python3 -m pytest test/test_idle_close_wait.py            # 2 pass
python3 -m pytest test/test_python_application.py         # 40 pass, 6 skip

Independence

This PR does not depend on PR #7 (P1) or PR #11 (P2). It branches off master and the touched files (nxt_port_socket.c, nxt_router.c:5895-5927) are disjoint from both. Any merge order works.

Out of scope

  • Recovery from transient port mem_pool OOM via timer-driven retry (would need new timer infrastructure).
  • Pre-existing nitems accounting on suspend-message error paths in nxt_port_queue_read_handler (caught during this PR's audit but not introduced by it; filed under the port IPC leak discussion in PR fix(port): plug mp-pool retain and fd/buffer leaks in IPC reply paths #8).
  • Anything else in roadmap/plan-graceful-shutdown.md Pattern D — that's P1/P2/P4/P5/P6/P7.

Refs


Generated by Claude Code

…(P3)

Phase 3 of roadmap/plan-graceful-shutdown.md (Pattern D' on the
`roadmap` branch).  Audits the five upstream TODO/FIXME sites
flagged for the "silent fall-through on write paths" bug family
and applies PR nginx#54's contract where it matters.  Of the five,
two are real NULL-deref bugs that needed code fixes; three are
documentation-only dispositions.

Real fixes
----------

src/nxt_port_socket.c:746  nxt_port_read_handler
  The empty `if (b == NULL) {}` body fell through to
  `iov[1].iov_base = b->mem.pos`, NULL-derefing on transient OOM
  in the port mem_pool.  Now alerts, disarms the read event via
  nxt_fd_event_block_read, and `goto fail` (the existing fail:
  label at :802 queues nxt_port_error_handler).  Strict
  improvement: crash -> orderly teardown.  Recovery via
  timer-driven retry was rejected as out of P3 scope.

src/nxt_port_socket.c:889  nxt_port_queue_read_handler
  Same NULL-deref class -- `b->mem.pos` was used in both the
  dequeue memcpy and the iov[1] recv branch.  Mirror the fix:
  alert, disarm, decrement queue->nitems (balances the entry
  +1 at :830), and queue nxt_port_error_handler.

Both sites now carry an `nxt_assert(b != NULL)` immediately past
the OOM-teardown block, documenting the invariant for future
audits and providing a deterministic guard against a future
regression that re-introduces an empty if-body.

Documentation-only dispositions
-------------------------------

src/nxt_port_socket.c:1369  nxt_port_error_handler
  The bare `/* TODO */` historically asked for richer error
  context (e.g. surfacing socket.error to peers).  Audit found
  the handler already drains queued sends, releases buffers,
  and decrements port refcounts -- no silent-success or
  NULL-deref bug hidden here.  Replaced bare TODO with a
  disposition note; richer error responses are a future
  enhancement and not part of PR nginx#54's write-path contract.

src/nxt_router.c:5895,5917  nxt_router_get_mmap_handler
  Two `// FIXME` markers on best-effort RPC_ERROR sends.  The
  function is `void`, delivery failures surface in
  port->socket.error and the port-side teardown
  (nxt_port_error_handler) fires; peer-side RPC timeout is the
  ultimate backstop.  Replaced FIXMEs with a one-line
  disposition note and added explicit `(void)` casts to match
  the convention at the end of the same function.

Tests
-----

No new tests.  The NULL-deref sites cannot be reliably
reproduced from Python without malloc-failure injection
(LD_PRELOAD or AddressSanitizer with allocator hooks); ASan
validation is recommended in CI follow-up.  The
nxt_assert(b != NULL) lines provide deterministic in-tree
documentation of the invariant in lieu of a behavioural test.

Verified
--------

  ./configure --tests --modules=python && ./configure python \
    --config=python3-config && make -j$(nproc)              # clean
  python3 -m pytest test/test_idle_close_wait.py            # 2 pass
  python3 -m pytest test/test_python_application.py         # 40 pass, 6 skip

Refs
----
- roadmap/plan-graceful-shutdown.md P3 (Pattern D')
- PR nginx#54 (TLS write-path contract; the canonical shape mirrored
  by the two real fixes above)
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves error handling in src/nxt_port_socket.c by addressing potential NULL pointer dereferences during buffer allocation failures. It also updates code comments and explicitly handles return values in src/nxt_router.c. The review feedback suggests using specific format specifiers (%PI and %uD) for PID and ID types in log messages to improve type safety and remove unnecessary casts.

Comment thread src/nxt_port_socket.c
Comment on lines +761 to +763
nxt_alert(task, "port{%d,%d} %d: buf alloc failed; "
"disabling read",
(int) port->pid, (int) port->id, port->socket.fd);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The format specifiers in the nxt_alert call should be updated for correctness and consistency:

  • %PI should be used for nxt_pid_t (port->pid).
  • %uD should be used for uint32_t (port->id) to avoid potential overflow when casting to a signed int.

This also removes the need for casting the arguments.

            nxt_alert(task, "port{%PI,%uD} %d: buf alloc failed; "
                            "disabling read",
                      port->pid, port->id, port->socket.fd);

Comment thread src/nxt_port_socket.c
Comment on lines +922 to +924
nxt_alert(task, "port{%d,%d} %d: buf alloc failed; "
"disabling read",
(int) port->pid, (int) port->id, port->socket.fd);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The format specifiers in the nxt_alert call should be updated for correctness and consistency:

  • %PI should be used for nxt_pid_t (port->pid).
  • %uD should be used for uint32_t (port->id) to avoid potential overflow when casting to a signed int.

This also removes the need for casting the arguments.

            nxt_alert(task, "port{%PI,%uD} %d: buf alloc failed; "
                            "disabling read",
                      port->pid, port->id, port->socket.fd);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants